Transmitted HIV-1 is more virulent in heterosexual individuals than men-who-have-sex-with-men

Transmission bottlenecks introduce selection pressures on HIV-1 that vary with the mode of transmission. Recent studies on small cohorts have suggested that stronger selection pressures lead to fitter transmitted/founder (T/F) strains. Manifestations of this selection bias at the population level have remained elusive. Here, we analysed early CD4 cell count measurements reported from ∼340,000 infected heterosexual individuals (HET) and men-who-have-sex-with-men (MSM), across geographies, ethnicities and calendar years. The reduction in CD4 counts early in infection is reflective of the virulence of T/F strains. MSM and HET use predominant modes of transmission, namely, anal and penile-vaginal, with among the largest differences in the selection pressures at transmission across modes. Further, in most geographies, the groups show little inter-mixing, allowing for the differential selection bias to be sustained and amplified. We found that the early reduction in CD4 counts was consistently greater in HET than MSM (P<0.05). To account for inherent variations in baseline CD4 counts, we constructed a metric to quantify the extent of progression to AIDS as the ratio of the reduction in measured CD4 counts from baseline and the reduction associated with AIDS. We found that this progression corresponding to the early CD4 measurements was ∼68% for MSM and ∼87% for HET on average (P<10−4; Cohen’s d, ds = 0.36), reflecting the more severe disease caused by T/F strains in HET than MSM at the population level. Interestingly, the set-point viral load was not different between the groups (ds<0.12), suggesting that MSM were more tolerant and not more resistant to their T/F strains than HET. This difference remained when we controlled for confounding factors using multivariable regression. We concluded that the different selection pressures at transmission have resulted in more virulent T/F strains in HET than MSM. These findings have implications for our understanding of HIV-1 pathogenesis, evolution, and epidemiology.


Introduction
The bottlenecks in HIV-1 transmission result in a 'selection bias' favoring fitter transmitted/ founder (T/F) viruses over less fit ones [1,2]. Several recent studies have presented evidence of genetic, phenotypic, and clinical manifestations of the selection bias in small cohorts [1,[3][4][5][6]. The evidence is based on different attributes of fitness, each contributing to the establishment of infection and progressive disease. For instance, from 137 heterosexual (HET) donor-recipient pairs, T/F viruses were found to carry higher than average frequencies of amino acids associated with high in vivo fitness, in terms of protein stability, immune escape and compensation [1]. Similarly, from 127 discordant couples, higher viral replication capacity (vRC) early in infection was associated with faster decline of CD4 T cell counts [4].
The selection bias varies with the mode of transmission [3]. The stronger the bottlenecks, the fitter the corresponding T/F viruses are likely to be [1,2]. Anal intercourse is over 10-fold more permissive on average than penile-vaginal intercourse [7]. Analysis of T/F genomes from 131 subjects revealed that the T/F genomes were under greater positive selection in heterosexual individuals (HET), in whom the penile-vaginal mode predominates [8], than homosexual men, or men-who-have-sex-with-men (MSM), who transmit predominantly through anal intercourse [3]. Among HET, men had T/F viruses with higher predicted fitness in vivo than women [1], consistent with the asymmetry of the bottlenecks between insertive and receptive penile-vaginal intercourse [7].
An important question that follows is whether the differential selection bias across modes of transmission is manifested at the wider population level, extending beyond the restricted cohorts examined in the trials above. Such differential bias could contribute to variations in disease progression and treatment outcomes and underlie the diverse trajectories of the HIV-1 pandemic across infected groups in which different modes of transmission predominate. To answer this question, a marker of the manifestation of the fitness of the T/F viruses that is readily measured across large populations is necessary. Furthermore, infected groups must be identified in which the predominant modes of transmission have substantial differences in the associated bottlenecks, so that the implications of the selection bias are detectable with statistical significance. Here, we identified CD4 T cell counts measured early in infection as a suitable marker meeting the above criteria and MSM and HET as the relevant risk groups. We collated early CD4 count measurements in these groups across large populations and in different geographies and calendar years and analyzed them to deduce the impact of the differential selection bias across modes of transmission at the population level.

A marker and risk groups for assessing population-level transmission bias
Immediately following infection, CD4 T cell counts fall steeply, recover partially, and then settle within a few weeks/months to a value smaller than in the pre-infection state [9] (Fig 1A). Subsequent changes in the CD4 counts occur slowly, over many months to years. Thus, CD4 count measurements made early in infection tend to be close to the value to which the counts settle after the initial dynamics. These early CD4 counts are expected to be minimally affected by host-specific adaptive mutations [1] and, therefore, reflective of the effects of the T/F strains. The CD4 count is a key indicator of the severity of disease: the lower the CD4 count, the more severe the disease [9]. A pathogen that causes more severe disease is said to be more virulent [10,11]. Fitter T/F strains tend to be more virulent; in the above data from 127 acutely infected individuals, high vRC of the T/F viruses, corresponding to high viral fitness, was associated with low CD4 counts at 3 months post-infection (which roughly coincides with the time of seroconversion) [4]. We reasoned, therefore, that fitter T/F strains would lead to lower early CD4 counts.
HET and MSM are the two major risk groups driving the global HIV-1 epidemic [9]. They use predominant modes of transmission with a substantial difference in the associated selection bias [7]. Importantly, they display little inter-mixing in most geographical regions. We inferred the latter from the distinct prevalence of HIV-1 subtypes in the two groups, which we found across calendar years and geographical regions (Fig 1B and 1C and S1 and S2 Tables): MSM in western nations are dominated by HIV-1 subtype B, whereas HET comprise a mixture of subtypes [12], with subtypes B and C the predominant ones [13]. For instance, in the United Kingdom, from 2002-2010, MSM had nearly 90% subtype B infections, whereas HET had a little over 10% subtype B. Mixing between the two groups would have led to a more similar distribution of subtypes in the two groups. The two groups thus appear to have remained largely segregated. The difference in subtype prevalences holds also in Canada, Spain, France, and other nations [14][15][16][17][18][19][20] (Fig 1B and S1 Table). In China, the dominant subtype is CRF01-AE, which is present at a frequency of >50% in MSM but at <40% in HET (Fig 1C  and S2 Table) [21], perhaps indicative of more mixing than in Europe. (In Korea, the extent of mixing could not be ascertained using subtypes because over 80% of all infections were subtype B [22]. We therefore did not include data from Korea [23] in our analysis.) In USA, though subtype B dominates both MSM and HET [24,25], mixing between the groups has been argued not to be common [26]. Overall, thus, little mixing between MSM and HET is evident in most geographical settings.
Together, these characteristics allow for the difference in the selection bias between the two groups to be sustained long-term, potentially amplified, and manifested in sample sizes large enough for detection with statistical significance. We thus hypothesized that the stronger selection bias associated with penile-vaginal transmission than anal transmission would result in fitter, more virulent T/F strains and, hence, lower early CD4 counts in HET than in MSM.

CD4 counts early in infection in HET and MSM
To test this hypothesis, we collated available data of CD4 count measurements either at seroconversion or at diagnosis from all large studies [19,[27][28][29][30][31] (n ≳ 1,000), which amounted to a total of *340,000 patients across four geographical regions, viz., China, Europe, the UK, and the US, followed over a total period of nearly four decades, and examined the differences between HET and MSM (Methods; Table 1 and S3-S5 Tables). Because sample sizes were large, we employed measures of centrality (such as mean and median) and supplemented significance estimates (P values) with effect size measures (Cohen's d, denoted d s ) for our analysis (Methods). Individual-patient data was not reported in these studies and was not necessary for our inferences. We found that HET consistently had lower CD4 counts than MSM (Fig 2 and

PLOS PATHOGENS
Virulence of transmitted/founder HIV-1 in different risk groups  Remarkably, although the effect sizes varied across studies, we did not find any large study that reported higher early CD4 cell counts in HET than MSM.

Relative reduction in CD4 counts early in infection
While the evidence from absolute CD4 count comparisons was thus strong, differences in CD4 counts in healthy (uninfected) individuals across sex, ethnicity and geographical regions could render absolute CD4 counts only an approximate measure of the virulence of the T/F strains. Two individuals may have similar early CD4 counts but may still have been infected by T/F strains of different fitness if their pre-infection CD4 counts were different, with the individual with the higher pre-infection count infected by the more virulent T/F strain. To overcome this limitation, we constructed a metric to quantify the relative reduction in the CD4 cell count, R, corresponding to the absolute CD4 count, T, as R ¼ � 100, where T healthy was the count pre-infection and T AIDS = 200 cells/μL the count defining AIDS. R thus represented the reduction in CD4 counts, ΔCD4, relative to the reduction signifying AIDS, ΔCD4 AIDS . Accordingly, R was 0% when T = T healthy and 100% when T = T AIDS and decreased linearly with T between these extremes. R was thus a more reliable indicator of disease severity than absolute CD4 counts. To use this metric, we collated measurements of T healthy specific to the respective geographies, ethnicities, and sexes [32, 35-39] (S6 Table).
Using the latter data, we estimated R corresponding to the early cell count measurements above, which we denoted as R T/F , indicative of the relative reduction in CD4 count due to the T/F virus (Fig 3 and  In some studies, data was available separately for HET men and women, allowing a comparison between HET men and MSM, thus eliminating potential confounding effects of sex (Table 1). In EU/EEA, during 2010-18, R T/F in HET men was 90%, much higher than the 82.9% in HET women, indicating that the difference between HET and MSM was amplified upon eliminating the effect of sex. We recall that R T/F was 66.2% in MSM during the same period, significantly smaller than HET men (P<10 −4 ; d s = 0.63) and HET women (P<10 −4 ; d s = 0.20). Among the seroconverters in Europe and Australia, in those aged below 40 years, R T/F in HET men was 46.4%, higher than the 40% in MSM (P = 0.01; d s = 0.12). The difference was similar in those aged above 40 years; R T/F was 52.4% in HET men and 46.2% in MSM (P = 0.073; d s = 0.16). Thus, in these comparisons too, where the effects of sex, age, and diagnosis delay were eliminated, HET had a consistently higher R T/F than MSM.

PLOS PATHOGENS
Virulence of transmitted/founder HIV-1 in different risk groups Overall, thus, R T/F comparisons showed more significant differences between MSM and HET than absolute CD4 count comparisons (see Figs 2 and 3). Further, R T/F allowed comparison across the different datasets. Thus, while the HET all had R T/F >85% at diagnosis, the MSM displayed a range from *65% to a little under 80%. This was comprehensive evidence of the greater virulence of the T/F virus in HET than MSM.

Set-point viral load
To understand this finding further, we recognized that virulence can have pathogen loaddependent and -independent components [10,11]. If pathogen load-dependent components were predominant, HET would have higher pathogen loads on average than MSM, which would then explain the higher R T/F . The set-point viral load (SPVL) is established within weeks of infection and stays nearly constant for years [9], and is recognized as a good measure of the pathogen load in HIV-1 infection [10,11]. A subset of the above studies reported SPVL along with early CD4 counts. We found in the latter studies that the mean SPVL was similar in MSM and HET, with the effect sizes negligible and no consistent trend towards higher SPVL in HET

PLOS PATHOGENS
Virulence of transmitted/founder HIV-1 in different risk groups (Table 2). For instance, the mean SPVL was 4.4 log 10 copies/mL for both MSM and HET in the CASCADE study from 2006-2009 (P = 0.5; d s = 0.00). SPVL thus could not explain the differences in R T/F between MSM and HET. It is possible that viral loads in primary infection, before the establishment of SPVL, could be higher in HET than MSM. Measurements of viral load in primary infection are rare. A recent study did measure viral loads in primary infection in MSM and HET in a small sample (n * 20 each) and found no difference between the groups (P = 0.34) [3]. While this finding needs to be established more widely, it suggests that the differences in R T/F in the two groups are unlikely to have originated from differences in the pathogen load.

Per parasite pathogenicity of T/F strains
That disease severity can be decoupled from pathogen load has been recognized previously for HIV-1 [4,42]. A host that protects itself by suppressing the pathogen load is said to be 'resistant' to the pathogen [10]. In the context of HIV-1, greater resistance could arise, for instance, from stronger CD8 T cell responses, as suggested for elite controllers [9]. Our findings above of similar SPVL in the two groups thus suggest that MSM were not more resistant to HIV-1 than HET. A host that does not suffer disease despite high pathogen load is termed 'tolerant' to the pathogen [10]. Examples of such tolerance include HIV-1-infected viremic non-progressors, HIV-1-infected children, and SIV-infected sooty mangabeys [9,10,43]. MSM may thus be viewed as more tolerant to T/F strains than HET.
Hosts and pathogens can both evolve to achieve greater tolerance, as such evolution would benefit both [10]. Decoupling the contributions of the two to tolerance, however, is difficult. Nonetheless, the difference in tolerance between MSM and HET is a measure of the difference in the 'per parasite pathogenicity' of the T/F strains in the two groups. The per parasite pathogenicity measures the pathogen load-independent contribution of the pathogen to virulence [10,11]. To quantify the per parasite pathogenicity of the T/F strains, which we denoted P T/F , we followed the procedure in earlier studies [11] (Methods). At any given SPVL, the deviation in R T/F in an individual or group from that expected in the population would yield the P T/F in that individual or group [11]. An R T/F higher than expected would imply higher P T/F . Because SPVL in the MSM and HET were close, we could directly compare R T/F between the two and estimate the relative P T/F . (Accounting for the small differences in SPVL between the groups did not alter our findings; see Table 2.) Specifically, for the CASCADE data mentioned above, R T/F was 55.5% in HET and 44% in MSM (P*10 −8 ; d s = 0.22 (Table 2)). The difference was a measure of the P T/F in HET relative to MSM. This higher P T/F in HET could thus be argued to have resulted in *(11.5/55.5) × 100 = 21% higher virulence in HET than MSM. Similarly, for the entire duration from 2003-2009 in the same study, the higher P T/F corresponded to *17% higher virulence in HET than MSM. In the European study from 2002-2007, the corresponding estimate was *24% (Table 2). More broadly, in all the studies above (Table 1), the greater R T/F in HET implied, assuming similar SPVL in the two groups, a greater overall P T/F in HET than MSM. The greater virulence of the T/F strains was thus consistent with greater pathogenicity, reflective of the effects of the stronger selection bias at transmission, in HET than MSM.

Confounding factors and regression analysis
Several factors could compromise our inference above of the differences in R T/F between HET and MSM being attributable to the differential selection bias at transmission. These factors include timing of onward transmission, diagnosis delay, HIV-1 subtype, ethnicity, sex, and age. We examined these factors next.

PLOS PATHOGENS
Virulence of transmitted/founder HIV-1 in different risk groups Early transmissions are more common to MSM than HET [44,45]. Strong evidence of this observation comes also from the greater association of MSM with transmission clusters, which we ascertained from numerous sources [12,18,[46][47][48][49][50][51][52][53][54]. A transmission cluster comprises individuals carrying viral genomes that cluster together in a phylogenetic tree [55], suggesting that the viral sequences isolated from the individuals are closely related. In Japan and China, an infected MSM had a nearly 40% chance of being part of a cluster, whereas an infected HET had <10% chance [46]. In France, the corresponding numbers were *35% and *4%, respectively [50]. This trend was true for all the countries with data available except the Netherlands (Fig 4A). MSM also formed larger clusters than HET. The largest clusters reported in Belgium and Spain comprised nearly 100 individuals each, with the Belgian cluster containing *70 MSM and the Spanish cluster exclusively MSM (Fig 4B) [18,48]. Together, these data suggest greater similarity in the viral strains in MSM than HET. One way in which this greater Table 2. Set-point viral load and relative per parasite pathogenicity of T/F strains in HET and MSM. SPVL � , early CD4 cell counts, the corresponding R T/F in infected adults at seroconversion (from the CASCADE study [41]) or diagnosis (from the Europe study [19]). For the CASCADE data, the mean and SD were calculated from the median and 95% CIs (see Methods) obtained by digitizing using WebPlotDigitizer. The 95% CIs provided here are following the normal approximation. Information for the European study is in Table 1.

Region Duration Risk group
Sample size (n) � Viral load in primary infection too did not show significant differences between the groups in a study that recently reported these measurements [3]. We extracted the reported viral load data, from 16 MSM and 17 HET in Fiebig stages II and III, all infected during 1990-02 with subtype B, and found that the mean viral loads (SD) were 6.47 (0.80) and 6.38 (0.62) log 10 copies/mL in MSM and HET, respectively (P = 0.34). †  $ We used a more detailed procedure to estimate P T/F by accounting for the difference in SPVL between the groups following previous studies [10,11] (Methods). We found that the contributions varied from *10 − 25%, similar to the values estimated assuming no difference in the SPVL between the groups.

PLOS PATHOGENS
Virulence of transmitted/founder HIV-1 in different risk groups similarity could arise is by onward transmission occurring sooner after infection in MSM than HET, allowing lesser individual host-specific adaptation before transmission. This should have led to higher R T/F in MSM than HET, in contrast to our findings and ruling out the timing of onward transmission as a confounding factor. Although MSM tend to be diagnosed earlier than HET [30], the differences in R T/F between the groups are seen also in CD4 counts at seroconversion [29,30] (Table 1), which would occur at similar times post infection in the two groups. Besides, in China, owing to social stigma, MSM may not get diagnosed earlier than HET [31], in which case the dependence of R T/F on diagnosis delay in the two risk groups would be the opposite of what is expected from the corresponding selection bias. Diagnosis delay thus appeared not to be a major factor confounding our inference above. To ascertain this further, we quantified the effect of diagnosis delay on our estimates of the differences in R T/F between MSM and HET. In the US, the median diagnosis delays during 2006-15 were *4 years and *5.4 years for MSM and HET, respectively [30,56]. Using the CD4 counts at diagnosis and their subsequent decline rates [30,56], we projected the CD4 counts in MSM to 5.4 years post seroconversion (Methods), which yielded R T/F of 80.3%, substantially lower than that for HET of 85.8% at the same time (Table 1). Projecting the R T/F of HET to 4 years post seroconversion and comparing with MSM yielded similar conclusions. Thus, diagnosis delay could explain only a part of the differences in R T/F between MSM and HET at diagnosis. Indeed, as mentioned above, extrapolating  S8 Table). https://doi.org/10.1371/journal.ppat.1010319.g004

PLOS PATHOGENS
Virulence of transmitted/founder HIV-1 in different risk groups all the way to seroconversion, thereby eliminating effects of diagnosis delay completely, did not eliminate the differences in R T/F between MSM and HET (Table 1). Rather, the difference was amplified (R T/F was 51% in MSM and 64.7% in HET; see Table 1). (Note that with time R T/F in both MSM and HET would approach 100%, shrinking the difference between them and explaining the amplified difference at seroconversion.) Similarly, after extrapolating to seroconversion, substantial differences (*20 percentage points) in R T/F persisted in the UK and EU/EEA cohorts as well, ruling out diagnosis delay as a major confounding factor.
Recall that MSM are predominantly infected by subtype B, whereas HET by B and C ( Fig  1B). Subtype B is thought to be more virulent than subtype C [42,44]. Moreover, in the US, where subtype B is predominant in both MSM and HET [24,25], R T/F was lower in MSM (Fig  3). In agreement, subtype B T/F viruses have been found to have higher fitness in HET than MSM [3]. The effect of subtype should thus have resulted in lower CD4 counts (higher R T/F ) in MSM than HET, a trend opposite of what is observed. We quantified the effect of subtype on R T/F and found that a negligible portion (* 2 percentage points) of the difference in R T/F between MSM and HET was attributable to subtype (Methods). We concluded therefore that subtype was not a factor confounding our inference above.
In both China and the US [57], MSM and HET have similar ethnicities, and yet R T/F is lower in MSM than HET. In Europe (EU/EEA), while MSM are predominantly Caucasian, 30-35% of infected HET are of sub-Saharan African origin [19,27]. Accounting for baseline CD4 count differences across ethnicities in EU/EEA did not alter our findings (Table 1), suggesting minimal effects of ethnicity on R T/F . The remaining factors, age and sex, could exert confounding effects. MSM are typically diagnosed at a younger age than HET. [19,27] The younger age could result in better ability to fight disease and hence lower R T/F in MSM than HET. HIV-1 is known to progress differently in men and women, with women typically establishing lower SPVL but progressing faster to AIDS [58]. The latter could result in higher R T/F in HET than MSM.
To delineate the dependence of R T/F on the risk group (MSM and HET) from that on the confounding factors, we performed regression analysis (Methods). We found a small effect of age and a negligible effect of sex on R T/F (Table 3). These minimal effects are consistent with empirical observations. For instance, in the two European studies, MSM were 5 years [27] and 1.6 years [19] younger on average than HET at diagnosis (S5 and S7 Tables). Given the cell count decrease of *7 cells/μL per year of age at diagnosis [28], the early CD4 counts should have been higher in MSM by only *35 and *11 cells/μL, whereas they were higher by 135 and 143 cells/μL (Fig 2 and Table 1), respectively, a difference that could not be explained by the age at diagnosis. Similarly, healthy men had lower CD4 counts than HET and healthy women everywhere except China [32, 35-39] (S6 Table), and infected HET men displayed higher R T/F than MSM (Table 1), consistent with the lack of an effect of sex on the lower R T/F Table 3. Results of regression analysis. The change in R T/F (in %) due to variation in each factor is quantified by the associated slope. Given the dummy coding for sex (0 for female, 1 for male), R T/F is higher in males than females by 1.30 percentage points. Similarly, given the dummy coding for risk group (0 for MSM, 1 for HET), R T/F is higher in HET than MSM by *18 percentage points. (See Methods for details).

PLOS PATHOGENS
Virulence of transmitted/founder HIV-1 in different risk groups in MSM. Nonetheless, the regression analysis indicated, after controlling for the effects of these latter factors, that R T/F was higher in HET by *18 percentage points than MSM (Table 3). This was consistent with the difference in R T/F we estimated between the groups across different studies (Table 1). We concluded therefore that this difference originated from the variations in the phenotype of the T/F strains in the two groups arising from the different selection biases at transmission.

Discussion
The bottlenecks to the transmission of HIV-1 are expected to drive HIV-1 evolution and influence the design of prevention strategies [2]. The bottlenecks are affected by the mode of transmission. Because different risk groups tend to use different predominant modes of transmission, it is possible that the T/F strains of HIV-1, directly affected by the bottlenecks, may have evolved differently in the different groups. Indeed, evidence of genetic differences between the T/F strains in MSM and HET in small cohorts has been gathered [3]. Clinical manifestations of these differences at the wider population level, however, have remained elusive. Our study shows, for the first time, that the selection bias at transmission is an important underlying factor shaping HIV-1 adaptation at the population level. The reduction in CD4 cell counts early in infection was substantially higher in HET than MSM, consistent with the more stringent selection at transmission resulting in more virulent T/F strains in HET than MSM. Our inference is based on data from several large population studies, cutting across geographies, ethnicities, and calendar years, indicating its robustness and wide applicability. We focused on HET and MSM based on previous studies that suggested large differences in the selection bias at transmission in the two groups [7] as well as rare inter-mixing between the two groups [26]. We found strong support for the latter observation by examining the prevalence of HIV-1 subtypes in the two groups, which were vastly different in most geographies we considered. This implied that the differential adaptation of HIV-1 to MSM and HET may have been sustained and led over the years to the selection and, possibly, fixation of different adaptive mutations in the T/F viruses in the two groups, consistent with observations from small cohorts [3]. Future studies may establish them at the population level, as sequencing technologies that allow facile identification of T/F viruses emerge. The technologies may also serve to elucidate such differences between other infected groups, which are likely to be present to lower degrees than between MSM and HET, depending on the differences in the selection bias between the groups, the exclusivity of the associated modes of transmission, and the extent of mixing between the groups.
An important theory of HIV-1 evolution at the population level is the adaptation of heritable traits such as SPVL for maximizing transmission. In a series of seminal articles, an intermediate SPVL has been argued, supported by data, to maximize transmission, striking a balance between increasing transmissibility and decreasing host survival with increasing SVPL [59,60]. If the different selection biases in HET and MSM were to lead to different dependencies of the transmissibility on SPVL, then the above theory would predict different values of the optimal SPVL in MSM and HET. From the studies we examined, a consistent trend in the differences in the SPVL values between MSM and HET did not emerge. It is possible, therefore, that the dependencies of transmissibility on SPVL may not be significantly different across MSM and HET. At the same time, previous studies have recognized that SPVL may not be the sole target of such evolutionary optimization [4,6,42]. For instance, HIV-1 subtype C has lower in vivo fitness than subtype B despite leading to higher SPVL [42,44]. Similarly, CD4 count decline appeared independent of the SPVL during the initial years post-infection [4,6]. The differential selection bias at transmission manifested in our study in the reduction in early

PLOS PATHOGENS
Virulence of transmitted/founder HIV-1 in different risk groups CD4 counts. Future advances to the theory of HIV-1 evolution may thus consider selection acting on multiple traits affecting transmission.
More fundamentally, whether HIV-1 is evolving in humans to be more or less virulent with time remains an unresolved question. Different studies have argued that the virulence is increasing [61], stable [62], or decreasing [44] with time. A more recent study that examined trends in CD4 counts at seroconversion in the CASCADE cohorts found that the counts declined from the early 1980s and reached a plateau around 2002, following which significant evolution was not evident [41]. The causes of these trends have remained elusive. In our present study, 4 of the 6 datasets we examined, which contained all but 13000 patients, belong to the post 2002 period (Table 1), where evolution of early CD4 counts is expected to be minimal, because of which we used counts averaged over the study periods in the respective datasets. An inference from our findings would thus be that the early CD4 counts in MSM and HET appear to have settled over the years to distinct plateau values. How the differential selection bias at transmission in the two groups, together with other potential evolutionary pressures, may have resulted in the different plateaus is an interesting question to address. Complex nested models [63,64], which couple within-host and population-level evolution of the virus, would have to be developed, perhaps incorporating selection over multiple heritable traits, to help deduce the causes of the observed evolutionary trends.
Another aspect that warrants further investigation is the influence of mixing between groups. Although evidence of minimal mixing between MSM and HET exists, including from the comprehensive data of subtype prevalences that we have collated as part of the present study, the effect that the limited mixing may have on the differences in early CD4 counts, and hence R T/F , between the groups remains to be quantified. Specifically, how a strain circulating in MSM, for instance, would behave when transmitted to HET, due to mixing, and vice versa, would be interesting to examine. Detailed natural history of the HIV epidemic in select countries provides some insights [65,66]. If the population density is small, most transmission events may not lead to the establishment of an epidemic; indeed, one of approximately 25 transmission events is estimated to have resulted in a sustained epidemic in Greenland [65]. The intriguing notion of source-sink dynamics, developed first for ecological settings, has recently been proposed as relevant to natural HIV infection to assess how epidemics are sustained in geographically separated populations with diverse population densities [67]. Here, a region of high population density with a sustained epidemic can ensure the sustenance of the epidemic in a region of low population density via migration. We envision such approaches being adapted to assess the effect of mixing between risk groups. We recognize though that such models will necessarily be more involved than models aimed at describing the effects of geographical heterogeneity in population density. In the latter case, the virus and host traits are identical. With risk groups and possibly different phenotypic T/F strains, viral evolution will have to be superimposed on a spatial epidemic model.
The less stringent transmission bottleneck associated with anal transmission than penilevaginal transmission results not only in less virulent T/F strains but may also allow a greater number of T/F viruses to establish infection in MSM than HET [68]. (Evidence that the number is not greater also exists [3].) The implications of the greater number of T/F strains remain to be elucidated. We speculate that its effect on early CD4 count decline is likely to be minimal because the latter represents a rapidly established balance between viral replication and immune control before much viral adaptation to the host. Thus, we expect our estimates of R T/F to remain robust to the variations in the number of T/F virions. On the other hand, the subsequent decline of CD4 counts leading to AIDS may be affected by these variations. A greater number of T/F virions may imply greater genomic diversity, which may allow easier immune escape, including via recombination [69,70], and expedite CD4 decline. Indeed, in

PLOS PATHOGENS
Virulence of transmitted/founder HIV-1 in different risk groups some studies, MSM have been observed to exhibit faster CD4 decline despite higher initial CD4 counts than HET [29,40].
A limitation of our study is the use of population level rather than individual level data. Currently, no dataset exists that reports measurements of all the key factors involved at the individual level in large patient cohorts. Future studies may fill this gap. Nonetheless, we recognized, based on strong independent evidence, that controlling for the effects of these factors would not substantially change the effect of risk group. Multivariable regression that controlled for confounding factors did establish the effect of risk group on early CD4 count decline. We expect our findings, therefore, to be robust.
In summary, our study presents the first large-scale evidence of a clinical manifestation of the selection bias during HIV-1 transmission, with implications for our understanding of HIV-1 pathogenesis, evolution, and epidemiology.

Data of CD4 counts
We collated data from all large studies (n ≳ 1, 000) that reported CD4 counts either at diagnosis or seroconversion in HET and MSM (Table 1 and S3-S5 Tables). From reports on countries in the EU/EEA (for the combined set of 21 countries) and China [27,31], we digitized the median CD4 counts using WebPlotDigitizer (https://automeris.io/WebPlotDigitizer). For our analysis, we averaged the data over the study duration. To obtain sample sizes, we multiplied the diagnosed cases with the reported fraction of diagnoses, available in the annual surveillance reports [71]. The fraction was assumed to be the same across the risk groups and countries. To obtain the population-weighted average CD4 counts, we assumed that the proportions of the populations in the different transmission categories were the same across age groups and that the fractions of men and women remained conserved (except in MSM and hemophiliacs). To calculate R T/F , we collated data of CD4 counts from healthy, uninfected adults in the USA, UK, Italy (which was used for the three studies involving European populations), Tanzania, and China (S6 Table). For R T/F calculations pertaining to the UK, CD4 counts from healthy MSM and HET were available, which we used. We found the counts in MSM comparable to those from healthy HET men (P = 0.22; see S6 Table). As a result, for other populations, we used the cell counts for healthy HET men when counts from healthy MSM were unavailable.

Estimation of centrality measures
When the median, m, and interquartile range (IQR), (q l , q u ), of CD4 counts (or other quantities) were available, we estimated the corresponding mean, μ, and standard deviation (SD), σ, using m ¼ mþq u þq l 3 and s ¼ q u À q l 1:35 , following a widely used method [72] applicable to large sample sizes, as considered here. When 95% confidence intervals (CIs), (c l , c u ), were available instead of IQR, we evaluated SD using another method [73] which yielded s ¼ ffi ffi when the sample size n ≳ 100. When CIs too were unavailable, we approximated the medians as the means, assuming the distributions to be normal. For data from China, where σ was available for the overall population, we estimated σ for MSM and HET using the proportion of the total σ attributed to the two risk groups [30]. When information necessary to estimate σ was unavailable, we used the highest σ from related datasets. The highest σ yielded an upper-bound on the associated P value. To estimate the SD of R T/F , we employed the error propagation equation [74] and derived s R T=F ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi  Table 1 and S6 Table. For s R T=F of all the data combined, we chose σ from the CASCADE

PLOS PATHOGENS
Virulence of transmitted/founder HIV-1 in different risk groups study. Finally, for CD4 counts and R T/F , we calculated the 95% CIs, shown in Tables 1 and 2, from the SDs.

Data of set-point viral loads
From among the studies that reported early CD4 counts in MSM and HET above, we collated corresponding data of SPVL where available (Fig 2). Data were available for MSM and the full population in 2003-05 and 2006-09 in the CASCADE cohorts [41], and for MSM and HET in the European study from 2002-2007 [19]. For the former, we obtained μ for MSM and non-MSM, with the latter containing *80% HET. Using the early CD4 counts in the various groups, we estimated R T/F .

Estimation of per parasite pathogenicity
To estimate the relative per parasite pathogenicity, P T/F , we adopted the procedure developed earlier [10,11]. Accordingly, one would deduce the dependence of R T/F on SPVL, or tolerance, in one group. Using the dependence, the R T/F of the group at the SPVL in the other group would be predicted. The deviation of the predicted R T/F from that measured in the other group would yield the relative P T/F . This procedure would thus quantify the effect of the T/F strain on R T/F beyond that expected from SPVL. As the SPVL were similar in MSM and HET, it followed that the difference in the measured R T/F between the groups would yield accurate estimates of P T/F . We ascertained this by also following the above procedure explicitly. We estimated the tolerance of MSM as a ¼ R MSM T=F ðSPVL MSM Þ 2 , assuming that CD4 decline (and hence R T/F ) was proportional to the square of SPVL [10,11]. Assuming a linear dependence instead did not alter our conclusions. The per-parasite pathogenicity [11]  � 100. We found that the contributions varied from *10 − 25%, which were similar to the values obtained by assuming that the SPVL were identical in the two groups ( Table 2).

Data of HIV-1 subtype prevalence
To assess the extent of mixing between MSM and HET, we collated data of the prevalence of HIV-1 subtypes in the two groups across relevant geographical regions and calendar years. The data are summarized in Fig 1B and 1C.

Multivariable regression
We performed multivariable regression using the linear model [78,79] to estimate the effects of age, sex, and risk group on the outcome (R T/F ). Accordingly, we wrote where β 1 , β 2 , and β 3 are the effects of age, sex, and risk group, respectively, on R T/F , and β 0 accounts for factors not considered. X 1 is the mean age, and X 2 and X 3 are categorical variables representing sex and risk group, respectively. We set X 2 = 0 for female and X 2 = 1 for male,

PLOS PATHOGENS
Virulence of transmitted/founder HIV-1 in different risk groups and X 3 = 0 for MSM and X 3 = 1 for HET. To estimate β 1 , β 2 , and β 3 , we adopted the following procedure.
We first considered the CASCADE 1979-00 seroconverter (SC) cohorts [29], which have reported data by categories aged <40 y and >40 y (Table 1). To find the mean age in the two categories, X 1,<40 and X 1,>40 , we employed the reported mean age X 1 for the overall sample of size n, and the reported sample sizes in the two categories, n <40 and n >40 , and wrote n × X 1 = n <40 × X 1,<40 + n >40 × X 1,>40 . Further, we let X 1,<40 and X 1,>40 be equally removed from 40 so that X 1,<40 + X 1,>40 = 2 × 40. We solved the latter two equations and obtained X 1,<40 and X 1,>40 . We applied this process separately for MSM, HET men and HET women. The data is collated in S7 Table. By subtracting Eq (1) applied to groups above and below 40 years, but within the same risk and sex groups, we obtained equations in β 1 . For instance, considering MSM (so that X 2 = 1 and X 3 = 0), we obtained R >40 T=F À R <40 T=F ¼ b 1 ðX 1;>40 À X 1;<40 Þ. Using population-weighted averages (equivalent to a best-fit) across MSM, HET men and HET women in the CASCADE cohorts, we obtained the β 1 . For the individual groups, namely MSM (n = 2, 941), HET men (n = 490), and HET women (n = 399), we obtained β 1 to be 0.25%, 0.25%, and 0.16% per year, respectively, yielding the average β 1 = 0.24% per year.
To examine the effect of sex, we again considered the CASCADE 1979-00 seroconverter (SC) cohorts [29], but now examined patients aged <40 years among HET men and women and patients aged >40 years in the same groups. Applying Eq (1) to the <40 years groups and subtracting yielded R men T=F À R women T=F ¼ b 1 ðX 1;men À X 1;women Þ þ b 2 . Using β 1 estimated above, the latter equation can be solved for β 2 . A similar procedure was followed for groups aged >40 years. We obtained β 2 for groups aged >40 years (n = 112) and <40 years (n = 777) to be 1.82% and 1.22%, respectively, yielding a population-weighted average β 2 of 1.3%.
To estimate β 3 , we first chose the Chinese cohort, where only age and sex would differ across HET and MSM. (Subtype is not a confounding factor because subtype CRF01_AE, which infects >50% of MSM and *40% of HET in China [21], is associated with a lower CD4 cell count than subtypes CRF07_BC and B [80] and higher viral load in MSM [80].) In the Chinese population examined (n = 218,039), *30.4% (n = 66,262) were women [31], yielding the fraction of women among HET of f w * 36.3%. The overall R T/F in HET would obey Recognizing that R HET men T=F ¼ b 2 þ R HET women T=F , using Eq (2), we obtained  Table). MSM are generally younger than HET by 2-5 years. Accounting for this age difference, we estimated β 3 = 18% (n = 126,643).
We next repeated the above exercise with the US cohort. Since ethnicities are comparable in HIV-1 infected MSM and HET in the US [57], these two groups at seroconversion differ only by sex. Using f w * 64.0% [30] in Eqs (2) and (3) along with R HET T=F ¼ 64:7% (S7 Table), we obtained R HET men T=F ¼ 65:5% and R HET women T=F ¼ 64:0% at seroconversion. Given that R MSM T=F ¼ 51% (S7 Table), it followed that β 3 = 14.5% (n = 9,286), similar to the estimate from China above. Averaging the estimates from China and the US, we calculated an overall β 3 = 17.8%, explaining the higher virulence of T/F strains in HET than MSM after controlling for confounding factors.

PLOS PATHOGENS
Virulence of transmitted/founder HIV-1 in different risk groups In the above calculations/cohorts, HIV-1 subtype was not a confounding factor. To estimate the potential contribution from subtype, we extended the regression analysis to include subtype and applied it to the EU/EEA cohort, where subtypes differ between MSM and HET. We thus wrote R T/F = β 0 + β 1 X 1 + β 2 X 2 + β 3 X 3 + β 4 X 4 , where β 4 is the effect of subtype on R T/F and X 4 is a categorical variable representing subtype. We set X 4 = 0 for subtype B, dominant in MSM, and X 4 = 1 for the collective of subtypes present in HET. The other terms were the same as in Eq (1). We used the diagnosis delays of 3 years for MSM and 4.9 years for HET men [81] and, following the procedure above, estimated the CD4 counts at seroconversion to be 412 cells/μL in HET men (R T/F = 69.8%) and 561 cells/μL in MSM (R T/F = 48.6%). Using estimates of β 1 and β 3 above together with the latter estimates of R T/F , we calculated β 4 = 2.2 percentage points, indicating a negligible effect of subtype on R T/F . Supporting information S1  [27]. The cell counts were estimated using WebPlotDigitizer (https://automeris.io/WebPlotDigitizer). The median ages, where available, are provided. The last row provides the mean cell counts (with SDs) and total numbers of MSM, HET men and women, respectively, estimated as in Methods. (PDF) S6 Table. CD4 T cell counts in healthy adults. Mean CD4 counts in healthy adults from different population groups which define baseline counts for estimating the relative reduction in early cell count following HIV-1 infection. Sample sizes are in brackets. SD is standard deviation. (PDF) S7

PLOS PATHOGENS
Virulence of transmitted/founder HIV-1 in different risk groups groups in addition to their combined set, as with the CASCADE 1979-00 datasets. For the continuous variables, the mean values are given where available. The diagnosis delay is expressed in years, whereas the viral load (SPVL) in log 10 copies/mL. A diagnosis delay of 0.25 years is used for seroconversion. Details regarding the estimation of age are given in the text. The rows titled CASCADE 2003-05, 2006-09, and 2003-09 contain the groups that were mentioned in detail in Table 2 in the context of viral load data. The last four rows are created using the information on age in S5 Table. (PDF) S8 Table. Association of MSM and HET with transmission clusters. The percentages of individuals found to be associated with transmission clusters in MSM and HET in several studies are collated. In the second column, the numbers in parantheses indicate sample sizes examined. Where available, P values and the largest cluster sizes are indicated as other details. (PDF) 29. CASCADE Collaboration. Differences in CD4 cell counts at seroconversion and decline among 5739 HIV-1-infected individuals with well-estimated dates of seroconversion. J. Acquir. Immune Defic. Syndr. 34, 76-83 (2003

PLOS PATHOGENS
Virulence of transmitted/founder HIV-1 in different risk groups